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Abstract 

u 

Convexity, though extremely important in mathematical programming, has not 
drawn enough attention in the field of dynamic programming. This paper gives condi- 
tions for verifying convexity of the cost-to-go functions, and introduces an accurate, fast 
and reliable algorithm for solving convex dynamic programs with multivariate continu- 
ous states and actions, called Adaptive Convex Enveloping. This is a short introduction 
of the core technique created and used in my dissertation, so it is less formal, and misses 
I/") some parts, such as literature review and reference, compared to a full journal paper. 

Background Story 

> 

One of my friends has a small firm that designs electric vehicle battery stations. A battery 
station works like a gas station: customers arrive to replace their used batteries with full 
ones and leave, and it is the battery station's job to charge the batteries in order to serve 
new customers. However, the electricity price fluctuates during the day. The battery station 
certainly wants to charge the batteries when the price is low, but it still has to satisfy 
customer demand. I wanted to help my friend develop an optimal charging policy, which is a 
typical dynamic program, with multivariate continuous states and actions. However, current 
ADP methods are not very satisfying when solving this class of problems. For example, 
algorithms generally require finding the optimal policy under the current approximations of 
the cost-to-go functions, but the approximations are in general not convex, sometimes very 
wavy, which makes finding the optimal policy difficult in multivariate problems. Also, even 
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a method is proved to converge under certain conditions, one generally doesn't know how 
the current policy compares to the true optimal one. To better solve this class of problems, 
I devised this new method — Adaptive Convex Enveloping (A.C.E.). 

Formulation 

Let the time periods be t = 1, . . . ,T. Let x t , u t and w t be vectors that represent the state, 
the action and the random information at stage t, respectively. Let Jt(x t ), x t e X t , be the 
cost-to-go function: 

J t (x t ) =inf E[ct(x t ,ut,w t ) + J t+ i(x t+1 (x t ,u t ,w t ))} 

s.t. g t (x t ,u t ) < 0, (1) 

where Ct(x t ,u t ,w t ) is a known function of the cost in period t, g t = (gij, ■ ■ ■ ,9i t ,t) T is a set 
of constraints on the action u t , given the current state x t , and Xt+i(x t ,Ut,w t ) is the state 
transition function. 

Assumptions 

• Jt(xt) is known and convex; 

• The state domain X t , t — 1, . . . , T, is closed, bounded and convex; 

• The cost function c t with w t fixed, and the constraints g i t , i = 1, . . . , I t , are jointly 
convex on X t x R*™(«t) ; 

• If the constraints are not all linear, then the set {(x t ,u t ) \ g t (x t ,u t ) < 0} must have a 
relative interior point; 

• The transition x t +i(x t , u t , w t ) is linear, i.e., x t+ \ = Ax t + Bu t + w t , where A and B are 
known matrices; nonlinear transitions are not investigated; 

• The distribution of w t is known, and is independent with x t and u t . 
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Convexity of the Cost-to-go Functions 

A convex approximation is only desirable when the target function is convex. So first we 
have a look at the cost-to-go function. Indeed, the convexity of the cost-to-go functions is 
a direct extension of the results from Fiacco and Kyparisis' paper Convexity and Concavity 
Properties of the Optimal Value Function in Parametric Nonlinear Programming (1986). 
Note that our action u t is their decision variable x, our state x t is their parameter e, and our 
cost-to-go function Jt(x t ) is their optimal value function /*(e). 

If J t+ i(x t+ i) is convex, then for any realization of w t , Ct(x t ,u t ,w t ) + J t+ i(x t+ i(x t ,u t ,w t )) 
is jointly convex on X t x R dim ("t) by our assumptions on c t and the transition, and so is 
E[c t (x t , u t , w t ) + Jt+i(x t +i(x t ,Ut,Wt))]. Since X t is a convex set, and constraints g^ t , i = 
1, . . . , I t , are jointly convex on X t x M. dim ( Ut \ the point-to-set map R(x t ) = {u t | g t (x t ,u t ) < 
0} is convex on X t by Proposition 2.3 from Fiacco and Kyparisis (1986). Then by their 
Proposition 2.1, J t (x t ) is convex on X t . 

Since Jt(xt) is convex, the above argument implies that Jt(x t ) is convex for all t = 1, . . . , T. 
Convex Enveloping of the Cost-to-go Functions 

Since the cost-to-go functions are convex, we can use supporting hyperplanes as their (outer) 
approximations. Assume that J t+ i(x t +i) has been investigated at {x 3 t+1 }^ =l1 where we know 
the function value J t+1 (x 3 t+1 ), as well as a subgradient VJ 1+ i(^ +1 ) (with abuse of the nota- 
tion). Thus for any point x t +i, we know from convexity that 

J t+ i(x t +i) > J t+ i(x J t+1 ) + VJ t+ i(iJ + /(i i+1 - x J t+1 ), j = 1, . . . , N. 

and J t +i(x t +i) is approximated as 

Jt+i(x t +i) = max { J t +i(a^+i) + VJ t+ i(^ +1 ) T (%i - x J t+1 )} . 

J Ij.-.jiV 

We now show how to efficiently obtain J t (x t ) and VJ t (x t ) to approximate J t (x t ). 

Select a sample {w^}^ =1 from the theoretical or the empirical distribution of w t , and rewrite 
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as 





J t (x t ) =inf P(wt)[c(xt,Ut,Wt) + J t+ i(x t+1 (xt,ut,w^))} 

11.4- * * 



K 



ut 

k=l 



s.t. g t {x t ,u t )<0. (2) 

If wt is discrete and the set of it's possible values is small, we can select all of its possible 
values as {w k }^ =1 ; otherwise {w k } k=l will have to be a sample from the distribution. The 
error introduced by this sampling will not be considered in this paper, and <^2§ will be treated 
as the definition of Jt{xt). 

To obtain J t and V Jt at a point Xt, we need to solve an optimization problem. Substitute J t +i 
with its approximation Jt+i- To simplify the evaluation of J t+ i(x t +i(x t ,u t ,w k )), we replace 
it with a decision variable J k +i, and add the supporting hyperplane constraints (linear): 



K 

min P ( w t) H x t, ut, w k t ) + Jf+i] 

s.t. g t (x t ,u t ) < 0, 

Jt+i > Jt+i(x J t+1 ) + W J t+ i(x{ +l ) T (x t+l (x u u t ,w k t ) -xi +1 ), for all j, k. 

The above formulation allows us to get Jt{xt), but to obtain VJt(xj), we need to change it 
a bit. We introduce decision variable St as a dummy of xf- 

K 

min P ( w t) t c (^' u u w k t ) + J k +1 ] 

st,u t ,{J t+ i} k=l 

s.t. gt(s t ,u t ) < 0, 

Jt+i > J t+i(4+i) + ^Jt+i(xl +1 ) T (x t+1 (s t ,u t ,w k ) - x J t+1 ), for all j, k, 

s t = x t . (3) 

Solving ^ not only gives us Jt(xt), which is the optimal objective value, but our assumptions 
also guarantee that there exists a Lagrange multiplier vector A associated to the constraint 
St = Xt, and that —A (or A, depending on how you write the Lagrangian function) is a 
subgradient of J t () at Xt- Thus we can obtain VJt{xt) for free by looking at the Lagrange 
multiplier of the constraint St — Xt- 

Note that approximating a function with supporting hyperplanes preserves convexity, thus 
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Figure 1: upper, lower bounds and max error 



([3]) is a convex program, and any local minimum is also a global minimum. 
Adaptive Convex Enveloping 

We just showed how to efficiently obtain J t and VJt to build supporting hyperplanes to 
approximate Jt(xt), but we haven't discussed where to build these hyperplanes. From the 
geometric point of view, the flatter the function in a region, the less supporting hyperplanes 
needed, vice versa. In principle, we want to be as economical as possible, because each 
supporting hyperplane will add a number of linear constraints to ^ in stage t — 1. With- 
out prior knowledge of Jt{x t ), presetting the investigation points {x 3 t } could be wasteful in 
supporting hyperplanes where the function is flat, and insufficient where the function is very 
curved. 

The approach used by A.C.E. — the reason why it got the name — is to learn the shape of 
Jt{x t ) on the way, adding supporting hyperplanes only where necessary. 

Error Control 

The supporting hyperplanes give a lower bound of Jt{x t ) at any x t G X t , and the convexity 
of Jt(xt) can give an upper bound — together they provide an error bound. 

Suppose that our state variable Xt is 1-dimensional and look at Figure [T] We have two 
tangents added at —0.3 and 0.1, respectively. By convexity, for Xt G [—0.3,0.1], Jt{xt) must 



be above the two tangents, and below the segment connecting the two points of tangency. 
Thus if we use the max of the tangents as the approximation of J t {xt), the maximum potential 
error at x t is the vertical distance from the max of the tangents to the segment connecting 
the two points of tangency. And the error of any point in the region [—0.3,0.1] is bounded 
by the maximum potential error at the intersection of the two tangents. 

In general, suppose that the domain X t is a p-dimensional body, and let be p+1 points 

whose convex hull H conv ({x{} P j=i) is also p-dimensional. If we add supporting hyperplanes at 
{x J t }, j — 1, . . . , N (N > p+1), and use the max of these hyperplanes as the approximation 
of Jt{xt), then for any xt € X t fl H conv ({x{}^=\) , the maximum potential error is the vertical 
distance from the max of the supporting hyperplanes to the hyperplane defined by the points 
{(x{, Jt(x{)Yj=f The maximum potential error for the region X t fl H conv ({x{}^l) can be 
found by expressing the point Xt as a convex combination of {x{}^=l and solving the following 
maximization problem: 

p+1 

max y ajJ t (x J t ) — J 

xt,a,J 

3=1 



p+1 

S.t. Xt = ~*y ^ CXjX't , 
3=1 



.1 



x t e x t , 

OLj > 0, for j = 1, ...,p + l, 

P+i 

J2 a i = 



L 5 

3=1 



J > Jt{x{) + VJ t (xl) T (x - xl), for j = 1, . . . , N. (4) 

There are two facts that may counter one's intuition when p > 1. First, the intersection of 
the supporting hyperplanes at {a^}^t| may not be in H conv ({x{}^l); it may not be in X t , 
too, even when H conv ({x{}^l) C X t . Second, a supporting hyperplane at x{, j > p+1, can 
be an active lower bound, even when x{ £ H conv ({x{}^={) . These are the reasons why we 
need to solve ^ to find the potentially worst point x t , and why we use all the supporting 
hyperplanes as constraints. 
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Figure 2: Partitioning at the potentially worst point 
Recursive Partitioning 



Let the optimal solution of (|4|) be (x t , a, J). The point x t is where the maximum potential 
error could occur. If this potential error is larger than the tolerance, we can add a supporting 
hyperplane at x t to reduce the error, and separate H conv ({x{} P 1l{) a t %t into p + 1 smaller 



convex hulls, each with xt and p points from {x{} P t^ as vertices (See Figure |2). Then we face 



r- 



p + 1 subregions whose maximum potential errors can be found by solving (|4]) again. Note 
that if x is on a facet or an edge of H conv ({x{} p 1ll) > some of the subregions will be less than 
p-dimensional. We ignore any subregion that is less than p-dimensional, as it is covered by 
the other p-dimensional subregions. We can tell if a subregion is p-dimensional by looking 
at a. 

To initiate the recursive partitioning algorithm, we need to have p+ 1 initial vertices {xj. } p ={ 
that satisfy two conditions: 1, their convex hull H conv ({x{} p ={) contains X t ; 2, the feasible 
action sets {u t \ gt{x 3 tl u t ) < 0}, j = 1, . . . ,p + 1, are nonempty. The first condition is for 
controlling the error at any point of X t , while the second condition is needed so that we can 
build supporting hyperplanes at these vertices. These points are usually not hard to find. 
For example, if the states have constraints < x t .i < 1, i = 1, . . . ,p, one may want to check 
the intersections of the hyperplanes Xi = 0, i = l,...,p, and Yli=i x i = P to see if they 
satisfy the second condition. However, it's not guaranteed that these points always exist or 
are always easy to find. If one really cannot find these points and has no way around it, 
then one may want to make some compromise and choose that satisfy the second 
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condition and their convex hull covers as much of X t as possible. 



Once we have the initial convex hull to start the algorithm, we separate it into sub-hulls if its 
maximum potential error exceeds the tolerance. Repeat this procedure recursively to all the 
sub-hulls, until all of their maximum potential errors are less than or equal to the tolerance. 

Algorithm 1 Recursive Partitioning 
Loop t from T-l to 1 

Read J t+1 supporting hyperplane information from file; 
Find initial vertices 

Add supporting hyperplanes at {x{}^\ by solving (3); 
Form the first item of the section list with {x{}^ = l; 
currentMaxError = tolerance + 1; 
While (currentMaxError > tolerance) 

If (list size > Budget) 

Print "Budget exceeded."; 
Break; 

End 

currentMaxError = 0; 

iterator = beginning of the list; 

While (iterator != end of the list) 

Solve ©; 

currentMaxError = max (currentMaxError, optimal value of (|4]) ) ; 
If (optimal value of © < tolerance) 

iterator++; 

Else 

Solve ©; 

Add a new supporting hyperplane at xt\ 

Separate the current section at x t to subsections; 

tmp_iterator = iterator; 

iterator++; 

Replace tmp_iterator with new subsections; 
End 

End 

End 

Write J t supporting hyperplane information to file; 
Release memory; 

End 
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Algorithm [T] shows how to solve the dynamic program with recursive partitioning. In the 
algorithm, we call a convex hull a "section", and we use a list of sections. The beginning of 
a list is the list's first item, while the end of a list is a position beyond the last item. An 
iterator of a list is like a pointer that points to an item of the list. The "++" operator points 
the iterator to the next item, but if iterator is the last item of the list, "iterator++" will 
point it to the end of the list, i.e., a position beyond the last item. 

Note that the error we talk about here is the error of J t relative to Jt+\- The absolute error 
of J t is the error relative to J t+ i plus the absolute error of Jt+i- 

Approximating by Importance 

Theoretically, we can approximate a function to any arbitrary high precision by reducing 
the tolerance. In practice, however, doing so to a high dimensional function is prohibitive. 
Meanwhile, pursuing a high precision at every point in X t could be wasteful, since in many 
applications the optimal policy tt tends to guide the process to visit frequently only a small 
portion of the domain. Therefore it is reasonable to only focus on these more important 
small portions. 

We don't know where these small portions are beforehand, however. What we could do is 
to use recursive partitioning to obtain affordable and relatively good approximations of the 
cost-to-go functions, and use simulation to see where the policy guides the process. The 
following is one possible way to update the approximations by importance. 

Observe state path {xt}f = i under the currect policy; 
Loop t from T-l to 1 

Find the section that contains x t ; 

If (maximum potential error > tolerance) 

Add a supporting hyperplane at x t ; 
Separate the current section at x t ; 

End 

End 

This refining process can run as long as needed. Note that this recursive partitioning & 



9 



20 
15 
10 

5 



2 4 6 8 10 12 14 16 

Figure 3: A.C.E. approximation of Jr-i 

approximating by importance approach bypasses the exploration vs. exploitation dilemma, 
which troubles many of today's popular ADP methods. 

Example 

Here is a demonstration that applies A.C.E. to the well-known inventory control problem. 
Let the purchasing cost, penalty and holding cost be 2.0, 4.0 and 0.2, respectively. Let the 
demand be uniform on [0, 10], and let the sample be 0.0, 0.1,. ..,9.9. The cost-to-go function 
is known to be convex, which also could be confirmed by checking the convexity conditions 
given at the beginning of this paper. The optimal policy is well known, that is, to order 
and increase the inventory up to a level S, but the value of S is not known, and need to be 
computed numerically. 

The following are the approximations given by A.C.E. for x G [0, 15], t from T — 1 to T — 10, 
at tolerance = 0.1. is assumed to be zero. 

The approximation of Jt-i is shown by Figure [3j The optimal policy is to order up to 
5 = 4. For the region x < 4 and x > 10, we know Jt-i{x) should be linear, and we see 
A.C.E. didn't waste supporting hyperplanes (tangents) there. For the region 4 < x < 10, it 



S = 4 
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Figure 4: A.C.E. approximation of Jt-2 and Jt-xo 

can be shown that Jt-i(x) is quadratic under a demand from uniform distribution, and the 
approximation reflects the shape closely. Overall, the approximation has an error less than 
0.058 at any point x G [0, 15]. 

Figure [4] shows the approximations of Jt-2 and Jt-w- The optimal policy at stage T — 2 is 
to order up to S = 8. Starting (backwards) from stage T — 3, the optimal policy becomes 
fixed, and we always order up to S = 9. The error accumulates. At stage T — 10, the error 
bound for a point x G [0, 15] is 0.558. But, the cost-to-go accumulates, too, and compared 
to the scale of this target, the error is pretty small. 

The whole process took 0.87 second on my computer. Since this is a simple 1-dimensional 
example, approximating by importance is not needed. 

Features of A.C.E. 

• Solves convex DP with multivariate continuous states and actions; 

• It is a standardized, general purpose method (no parameter tunning, basis function 
choosing, kernel designing, etc.); 

• No assumption on the form of the cost-to-go functions — the form is learned on the 

way; 

• Implementation of mathematical programming allows large number of action variables; 
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The supporting hyperplane approximation preserves convexity of the cost-to-go func- 
tions, enabling reliable optimization; 

The supporting hyperplane approximation also "ignores" the state variables that do not 
add to the nonlinearity of the target function, e.g., approximating f(x\,X2, X3) = x\+x% 
is no more difficult than approximating f(x) = x 2 , although the dimension is higher; 

The computation is relatively light since we obtain the subgradients for free from the 
Lagrange multipliers. 

The supporting hyperplanes are constructed economically by need and importance, 
further reducing the computation; 

There are also problem dependent techniques that could greatly speed up the compu- 
tation. For instance, if c t and g t are piecewise linear, we can use dual simplex updates 
to obtain the supporting hyperplanes, instead of solving ^ from scratch, because ^ 
is now a linear program and the state Xt is a RHS; 

We know an upper bound (not the big-0 notation) of the error between the approx- 
imated and the true cost-to-go functions at any state, thus we know how the policy 
performs; 

There is no exploration vs. exploitation dilemma; 

Compared to stochastic programming, it stands out when solving problems with long 
planning horizons, as the CPU time grows linearly in the number of stages, and memory 
consumption remains the same; plus, it is dynamic. 
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